Script Identification from Bilingual Gujarati-English Documents
نویسندگان
چکیده
In a multi-lingual country like India, in most of the official papers, school text books, magazines, it is observed that English words intersperse within the Indian regional languages. So a bilingual Optical Character Recognition (OCR) system is needed which can recognize these bilingual documents and store it for future use. In this paper authors present an OCR system developed for the script identification of Indian language i.e. Gujarati and Roman scripts for printed documents. Here authors propose the line-wise script identification. The spatial spread of pixels on Upper and Lower parts associated with Gujarati and English are used to identify the script. Authors have used horizontal projection for line distinction belonging to different script. Further, K-nearest neighbour algorithm is used to classify 2000 text lines into two scripts: English and Gujarati, based on 4 spatial spread features extracted using connected component and horizontal projection. The proposed algorithm achieves average classification accuracy as high as 99.70% for bi-script separation.
منابع مشابه
Character Level Separation and Identification of English and Gujarati Digits from Bilingual (English-Gujarati) Printed Documents
Nowadays, it is observed that English script has interspersed within the Indian languages. So there is a need for an optical character recognition (OCR) system which can recognize these bilingual documents and store it for future use. Hence, in this paper an OCR system is proposed that can read documents containing Gujarati and English scripts (Only digits). These scripts have many features in ...
متن کاملIdentification of Printed Punjabi Words and English Numerals Using Gabor Features
Script identification is one of the challenging steps in the development of optical character recognition system for bilingual or multilingual documents. In this paper an attempt is made for identification of English numerals at word level from Punjabi documents by using Gabor features. The support vector machine (SVM) classifier with five fold cross validation is used to classify the word imag...
متن کاملWavelet Packet Based Texture Features for Automatic Script Identification
In a multi script environment, an archive of documents printed in different scripts is in practice. For automatic processing of such documents through Optical Character Recognition (OCR), it is necessary to identify the script type of the document. In this paper, a novel texture-based approach is presented to identify the script type of the collection of documents printed in ten Indian scripts ...
متن کاملA Comparative Analysis of Classifiers Accuracies for Bilingual Printed Documents (Oriya-English)
Bilingual document recognition has been the subject of intensive research and our focus is on the recognition of an Oriya-English bilingual documents. In most of our official papers, school text books, it is observed that English words interspersed within the Indian languages. So there is need for an Optical Character Recognition (OCR) system which can recognize these bilingual documents and st...
متن کاملMorphological Reconstruction for Word Level Script Identification
A line of a bilingual document page may contain text words in regional language and numerals in English. For Optical Character Recognition (OCR) of such a document page, it is necessary to identify different script forms before running an individual OCR system. In this paper, we have identified a tool of morphological opening by reconstruction of an image in different directions and regional de...
متن کامل